Finding Biologically Accurate Clusterings in Hierarchical Decompositions Using the Variation of Information

نویسندگان

  • Saket Navlakha
  • James White
  • Niranjan Nagarajan
  • Mihai Pop
  • Carl Kingsford
چکیده

Hierarchical clustering is a popular method for grouping together similar items based on a distance measure between them. These clusters can be used to infer annotations for uncharacterized items. However, in many cases, annotation information for some elements is known beforehand. We present a novel approach for decomposing a hierarchical clustering into the optimal clusters that match a set of known annotations, as measured by the variation of information metric. Our approach is general, and we apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For both applications, we test the quality of our clusters by using them to predict complex and species membership. We find that our approach generally outperforms the commonly used heuristic methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information

Hierarchical clustering is a popular method for grouping together similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decomposing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation o...

متن کامل

انتخاب اعضای ترکیب در خوشه‌بندی ترکیبی با استفاده از رأی‌گیری

Clustering is the process of division of a dataset into subsets that are called clusters, so that objects within a cluster are similar to each other and different from objects of the other clusters. So far, a lot of algorithms in different approaches have been created for the clustering. An effective choice (can combine) two or more of these algorithms for solving the clustering problem. Ensemb...

متن کامل

Temporal Hierarchical Clustering

We study hierarchical clusterings of metric spaces that change over time. This is a natural geometric primitive for the analysis of dynamic data sets. Specifically, we introduce and study the problem of finding a temporally coherent sequence of hierarchical clusterings from a sequence of unlabeled point sets. We encode the clustering objective by embedding each point set into an ultrametric spa...

متن کامل

Analysis and Optimization of Graph Decompositions by Lifted Multicuts

We study the set of all decompositions (clusterings) of a graph through its characterization as a set of lifted multicuts. This leads us to practically relevant insights related to the definition of classes of decompositions by must-join and must-cut constraints and related to the comparison of clusterings by metrics. To find optimal decompositions defined by minimum cost lifted multicuts, we e...

متن کامل

Comparing Clusterings by the Variation of Information

This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering C to clustering C′. The criterion makes no assumptions about how the clusterings were generated and applies to both soft and hard clusterings....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008